The California Teachers Study (CTS) is an observational cohort study collecting health and live behavior information of 133,477 women in CalSTRS system to study about risk factors for cancer and what to do to control disease. The CTS self-reported questionnaires data with a focus on OSHPD hospitalization records serve as the inaugural data set for this project. The overall object is to predict the short term risk of death based on the questionnaires baseline characteristics and hospitalization admission.The specific aims include:
To develop the best fitting model that predict the death risk within 30 days time window based on baseline characteristics and hospitalization.
What certain co-morbidities and death causation are important in predicting the risk of death.
Whether there are any temporal or spatial trends in hospitalization-related death.
In this study, the primary outcome of interest was death after hospital discharge. I choose to predict mortality within a short term time windows, namely 30 days following hospitalization discharge. The outcome was calculated and selected for the subsequent model development.
Data from CTS self-reported questionnaires and electronic health records during calendar years 2000-2015 were included in this study. Predictors covered demographic characteristics (age, race, height, weight, family history, physical activity, diet, alcohol and tobacco use) and hospitalization records (admission type, length of stay, major diagnosis, planned/unplanned visit, clinical diagnoses and procedures).
Predictors for clinical diagnose and procedure were coded under the International Classification of Diagnosis, 9th version (ICD-9-CM: 001-999, E and V codes ). In order to generalize, the first three digits of each code were taken and sorted into groups based on ICD9 classification of diseases system. Total 19 categories were classified for each diagnosis and procedure predictors. The cause of death predictor was coded under 10th version of under the International Classification of Diagnosis (ICD-10-CM: A00-Z99). It was categorized into 21 groups according to the first three digits of the ICD-codes. The top 20 CCS codes for diagnosis and procedure were categorized under the Clinical Classifications Software system and the left CCS codes were classified into “other”group.
Numeric predictors,such as,“age_at_baseline”,“height_q1”,“weight_q1”,“bmi_q1”, “length_of_stay, were treated rescaled to the range of 0 to 1. The other predictors were treated as binary or categorical indicators.
Missing values on the continuous predictors were imputed with its median value.Missing values on categorical predictors were coded as “00” to make a separate category.Some missing predictors appeared to co-occur in the same person, for instance, non-smoke people were corresponded with missing value on smoke-related features. Therefore, these missing value were imputed with 0 rather than median value. Redundant covariates and covariates with 90% missing data were excluded.
There are total 132538 subjects and 87 variables in the entire data set.The data set was randomly split into 70% training set and 30% test set.Logistic regression were performed to do feature selection. Covariates with a significant p-value (p<0.1) were selected for classification.
Random forest was developed under randomForest package with strata to make the same proportion of y, and set sampsize to balance samples.For bagging classification, setting with mtry to equal the number of features(all other parameters at their default values). XGBoost classification with default parameters tuning was performed for prediction. The top 20 important predictors were reported separately for each model based on the importance score.
Finally, I fitted additional random forest model using predictors restricted in survey data to explore the importance of the baseline characteristics. A random forest with predictors in diagnose medical conditions and death causation were conducted to explore the predictive potential of the medical records system.
The area under the receiver operating characteristic (ROC) curve (AUC), sensitivity and specificity for the test set were reported as the evaluation metric.
The mortality variable showed that 51.6% people were deceased after discharge. The time window variable was created to indicate the days between people discharge and death. The mean value of the time window was 1409 days and the longest time was 6739 days.The time windows were split into 6 categories: 30 days less, 180 days less,1 year less,5 years less, 10 years less and 10 years more.
Age at death variable was created to indicate the age distribution for those deceased people after discharge. The median age of deceased people was 86 years old. There were 6 categories of race for entire participants.90% of these participants were white and the deceased rate in these white people was approximate 50%.
The median value of length of stay in all participants was 3 days. Majority of participants stay at hospital less than 5 days. People with a short-term time window (less than 30 days) had a high proportion of longer staying in hospital than people with long-term time window (more than 10 years).The longest time of staying in hospital was 1792 days. In admission type variable, the proportion of scheduled and unscheduled admission type were more than 99%. The unscheduled admission type had a highest deceased rate.
The major diagnosis disease categories were divided into 25 diagnose groups. The top three diagnosis diseases for deceased people were circulatory system disease and disorder, musculoskeletal system disease and disorder and respiratory system disease and disorder. The top three diagnosis diseases for survived people were musculoskeletal system disease and disorder, respiratory system disease and disorder and digestive system disease and disorder.
Month variable was created to indicate the death month between these deceased population. The month death during years has a similar value.
The deceased distribution was plotted by facilities ZIP code.A data set including the facilities zip, the city latitude and longitude was created to map the deceased distribution of all hospitals.San Francisco county, Los Angeles county and Orange county were the three counties with the highest dense of population.The map showed that there was no much difference of deceased distribution of all hospitals.
The subjects with less than 30 days after discharge were selected as target population for prediction. There were 8507 subjects among the 68439 patients who were deceased after hospital discharge during year 2000-2015.Therefore, the analysis data set included 8507 cases and 124037 controls.Logistic regression model was trained to select features with significant (p<0.1). There were total 35 predictors included for the subsequent model development.
## p-value
## menarche_age1 1.442209e-02
## menarche_age2 4.924301e-02
## menarche_age4 3.155844e-02
## menarche_age5 4.448109e-02
## menarche_age7 2.872133e-02
## menarche_age8 1.683667e-02
## menarche_age9 1.216377e-03
## preg_total_q113 1.204410e-03
## meno_stattype1 9.252140e-06
## meno_stattype5 4.799140e-02
## height_q1 3.836846e-07
## weight_q1 7.742539e-04
## bmi_q1 6.212902e-05
## allex_hrs_q1 1.010715e-02
## smoke_expocat1 2.636038e-02
Random forest, bagging and XGBoost models were selected to perform classification. Results in table 1 showed that the test AUCs of the best model of each type were very similar.These 3 models gave high test AUCs and similar confusion metric values. The test AUCs ranged from 0.928 to 0.937. The sensitivity and specificity for these models were also similar. The difference in predictive performance between the 3 models was very small.
AUC | Sensitivity | Specificity | |
---|---|---|---|
Random Forest | 0.928 | 0.903 | 0.963 |
Bagging | 0.935 | 0.903 | 0.963 |
XGBoost | 0.937 | 0.913 | 0.963 |
The bar plots of each model were showing the top 20 predictors with the highest importance. There was a substantial overlap in the top-ranked predictors between 3 models.Patient disposition and age at death were the two most important predictors in all three models(accounting for almost 90% model performance) followed by death causation, diagnose medication. The other categories of predictor,such as height,weight,bmi,smoke use and reproductive history were much less important but still accounted for predictive performance.
To further identify the predictive potential of the medical diagnosis category and the death causation, the categorical predictors of major diagnose and death causation were selected and coded as dummy indicators. The importance plot indicated that the circulatory system disease,neoplasms and respiratory system disease were the top three death causation. In addition, the infectious,respiratory and nervous were the top three diagnose related to death.
The additional fitted model using predictors in survey was performed to explore the predictive potential of self-reported questionnaires data with health and live behavior information. The AUC for predicting outcome events within 30 days was 0.492(95% CI: 0.481-0.5039),suggesting limited value of the predictors in the survey data.
In this study, I performed three machine learning models to determine the potential of CTS self-reported questionnaires data with the OSHPD hospitalization records for relatively short-term prediction of mortality. There was a substantial overlap in top-ranked predictors between different models.The patient disposition and age were the top two important features in all three models. I identified that patient disposition accounted for almost the majority of performance in the prediction,and the other features were less important relatively.The AUC values of additional fitted models of survey data and diagnosis ranged from 0.4-0.5, which were much worse than the main models.Therefore, I think the patients disposition highly weighted model is not good for generalizability to other system yet.
Predict mortality is a complexity task since many features are involved in mortality. Whether exist or new novel predictors can improve predictive accuracy and generalizability requires further investigation.